The data used in this analysis is the results of comparing the telomere end-chromosome length assaying tools TECAT and Telogator, along with the short read analysis tool Telseq. The data used in this analysis is mostly 1000 genomes project data. The data used in this analysis is 1000 genome project samples HG002, HG003, and HG00731 from the 1000 genomes project. The data was processed using the TECAT, Telogator, and Telseq tools. I chose these samples because they had good matched long-read and short-read data.
The purpose of this analysis is to compare the results of the TECAT and Telogator tools to the Telseq tool.
The data used in this analysis is the results of comparing the telomere end-chromosome length assaying tools TECAT and Telogator, along with the short read analysis tool Telseq. Telomere lengths are important in determining FCR treatment response in CLL patients and have been shown to be a prognostic marker in CLL and other cancers. Telomere’s are also implicated in aging and neurodegenerative diseases. Normal telomere length varies from 1kb to 15kb depending on age. As of now there are two software packages which measure telomere length at the resolution of individual chromosomes: TECAT and Telogator. Telseq is a short read analysis tool that can be used to measure telomere length. The purpose of including Telseq in this analysis is to provide some type of ground truth for the other two tools.
# Create the data frame, including ineligible rows with a note
data <- data.frame(
Number = c(1, 2, 3, 4, 5, 6, 7, 8, 9, "-10", "-11", "-12", 10, 11, 12),
SampleID = c("HG002", "HG003", "HG004", "HG00731", "HG00732", "HG02011",
"HG02492", "HG03065", "HG03371", "HG03683", "NA12329",
"NA19983", "HG00514", "HG00733", "NA19240"),
Long_Read_Accession = c("SRR28295757-71", "SRR12898316", "SRR27010837",
"ERR11586165", "ERR4987503-05", "ERR11028132",
"ERR11028099", "ERR11028096", "ERR11028137",
"ERR3861403", "ERR3861409-10", "ERR3861402",
"ERR4982327", "SRR24401966", "ERR3219853-57"),
Short_Read_Accession = c("SRR14724532", "SRR14724530", "SRR14724529",
"ERR10967205", "ERR3241755", "ERR3988973",
"ERR3989019", "ERR3989118", "ERR3989162",
"ERR3989199", "ERR3989317", "ERR3989454",
"ERR3988781", "SRR5535410-11", "ERR3989410"),
LR_Type = c("ONT", "ONT", "ONT", "ONT", "PB", "ONT", "ONT", "ONT",
"ONT", "PB", "PB", "PB", "PB", "ONT", "ONT"),
Notes = c("", "", "", "updated SR SRR5534404-5", "", "", "", "", "",
"CLR - not eligible", "CLR - not eligible", "CLR - not eligible", "", "", "")
)
# Print the data frame
knitr::kable(data, caption = "20240801 List of Data Sets for Comparative Study")| Number | SampleID | Long_Read_Accession | Short_Read_Accession | LR_Type | Notes |
|---|---|---|---|---|---|
| 1 | HG002 | SRR28295757-71 | SRR14724532 | ONT | |
| 2 | HG003 | SRR12898316 | SRR14724530 | ONT | |
| 3 | HG004 | SRR27010837 | SRR14724529 | ONT | |
| 4 | HG00731 | ERR11586165 | ERR10967205 | ONT | updated SR SRR5534404-5 |
| 5 | HG00732 | ERR4987503-05 | ERR3241755 | PB | |
| 6 | HG02011 | ERR11028132 | ERR3988973 | ONT | |
| 7 | HG02492 | ERR11028099 | ERR3989019 | ONT | |
| 8 | HG03065 | ERR11028096 | ERR3989118 | ONT | |
| 9 | HG03371 | ERR11028137 | ERR3989162 | ONT | |
| -10 | HG03683 | ERR3861403 | ERR3989199 | PB | CLR - not eligible |
| -11 | NA12329 | ERR3861409-10 | ERR3989317 | PB | CLR - not eligible |
| -12 | NA19983 | ERR3861402 | ERR3989454 | PB | CLR - not eligible |
| 10 | HG00514 | ERR4982327 | ERR3988781 | PB | |
| 11 | HG00733 | SRR24401966 | SRR5535410-11 | ONT | |
| 12 | NA19240 | ERR3219853-57 | ERR3989410 | ONT |
## ── Attaching core tidyverse packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::arrange() masks plyr::arrange()
## ✖ purrr::compact() masks plyr::compact()
## ✖ dplyr::count() masks plyr::count()
## ✖ dplyr::desc() masks plyr::desc()
## ✖ dplyr::failwith() masks plyr::failwith()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::id() masks plyr::id()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::mutate() masks plyr::mutate()
## ✖ dplyr::rename() masks plyr::rename()
## ✖ dplyr::summarise() masks plyr::summarise()
## ✖ dplyr::summarize() masks plyr::summarize()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'cowplot'
##
## The following object is masked from 'package:lubridate':
##
## stamp
samples <- c("HG002", "HG004", "HG00731", "HG00732", "HG02011", "HG02492", "HG03065")
tecat <- list()
telogator <- list()
telseq <- list()
for(s in samples) {
tecat[[s]] <- readRDS(file = file.path(paste0(s), "tecat", "mapped.rds"))
telogator[[s]] <- read.table(file = file.path(paste0(s), "telogator", "tlens_by_allele.tsv"), header = TRUE, sep = "\t", comment.char = "")
telseq[[s]] <- readLines(file.path(paste0(s), "telseq", "telseq_results.txt"))
}
head(tecat$HG002$results)## telo_name ref_name strand start end qwidth score telomere_length read_length trunc_start trunc_end telomere_end chromEnd sample
## 1 split_014_telomere_2 chr10 + 3157 17222 14066 10 7733 21799 7734 21799 L 5' HG002
## 2 split_080_telomere_7 chr10 + 3157 41029 37873 4 6233 44106 6234 44106 L 5' HG002
## 3 split_131_telomere_5 chr10 - 3162 29229 26068 6 7851 33919 1 26068 R 5' HG002
## 4 split_016_telomere_10 chr10 - 3180 4747 1568 43 4502 6070 1 1568 R 5' HG002
## 5 split_130_telomere_10 chr10 - 3181 39648 36468 4 6370 42838 1 36468 R 5' HG002
## 6 split_023_telomere_10 chr10 - 3187 25768 22582 1 4719 27301 1 22582 R 5' HG002
## [1] 92 11
## [1] 1765 14
## [1] "X.chr" "position" "ref_samp" "allele_id" "TL_p75" "read_TLs" "read_lengths" "read_mapq" "tvr_len" "tvr_consensus" "supporting_reads"
## X.chr position ref_samp allele_id TL_p75 read_TLs
## 1 chr1p 3311 t2t-hg002-mat 47+0i 2238 1078,1345,1483,1503,1555,1570,1574,1654,1672,1698,1716,1788,2111,2183,2294,2467,2614,2647,10493
## 2 chr1p 4321 t2t-hg002-pat 86+0i 8667 279,3709,6650,7945,7947,7978,8102,8444,8623,8801,8952,8991
## 3 chr1q 495845 t2t-hg002-mat 41+0i 4684 -2031,-613,3388,3630,3686,3947,4022,4295,4322,4402,4405,4500,4519,4527,4640,4684,4805,4839,4945,8355,9483
## 4 chr1q 495897 t2t-hg002-pat 77+0i 5540 2899,3079,3335,3559,3795,4443,4541,4762,4957,5129,5267,5360,6082,6611,8345,12261
## 5 chr2p 2974 t2t-hg002-pat 19+0i 4506 3678,3788,3802,3828,3837,3864,4277,4339,4441,4453,4524,4557,5051,6850
tecat_lengths <- list()
telogator_lengths <- list()
telseq_length <- list()
for(s in samples) {
tecat_lengths[[s]] <- tecat[[s]]$results %>%
filter(!is.na(telomere_length)) %>%
pull(telomere_length)
telogator_lengths[[s]] <- as.numeric(unlist(strsplit(telogator[[s]]$read_TLs, ",")))
telseq_length[[s]] <- as.numeric(unlist(strsplit(telseq[[s]][4], "\t"))[7])
}
histo_data <- list()
for(s in samples) {
histo_data[[s]] <- data.frame(
length = c(tecat_lengths[[s]], telogator_lengths[[s]]),
tool = c(rep("TECAT", length(tecat_lengths[[s]])), rep("TELOGATOR", length(telogator_lengths[[s]])))
)
# Find mean for each tool
mu <- ddply(histo_data[[s]], "tool", summarise, grp.mean=mean(length))
hist <- ggplot(histo_data[[s]], aes(x = length, fill = tool, color = tool)) +
geom_histogram(aes(y=after_stat(density)), alpha=0.5,
position="identity", binwidth = 200)+
geom_density(alpha=.2) +
geom_vline(data = mu, aes(xintercept = grp.mean, color = tool), linetype = "dashed", linewidth = 1) +
labs(title = paste("Histogram of telomere lengths, sample", s),
x = "Telomere length (bp)",
y = "Count") +
theme_minimal()
print(hist)
ggsave(file = file.path("./figures", paste0(s, "_histogram.png")), plot = hist)
}## Saving 7 x 5 in image
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## Saving 7 x 5 in image
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
# Combine histo datas for both samples with sample labels
boxplot <- rbind(
cbind(histo_data[["HG002"]], sample = "HG002"),
cbind(histo_data[["HG004"]], sample = "HG004"),
cbind(histo_data[["HG00731"]], sample = "HG00731"),
cbind(histo_data[["HG00732"]], sample = "HG00732"),
cbind(histo_data[["HG02011"]], sample = "HG02011"),
cbind(histo_data[["HG02492"]], sample = "HG02492"),
cbind(histo_data[["HG03065"]], sample = "HG03065")
)
# Add telseq data to boxplot
boxplot <- rbind(
boxplot,
data.frame(
length = unlist(telseq_length) * 1000,
tool = rep("TELSEQ", length(unlist(telseq_length))),
sample = rep(samples, each = 1)
)
)
ggplot(boxplot, aes(x = sample, y = length, fill = tool)) +
geom_boxplot(outlier.shape = NA) +
labs(title = "Boxplot of telomere lengths",
x = "Sample",
y = "Telomere length (bp)") +
theme_minimal()# Create a plot that shows the difference between each method and the telseq method/tool
# Calculate the mean of each method
tecat_means <- sapply(tecat_lengths, mean)
telogator_means <- sapply(telogator_lengths, mean)
telseq_means <- sapply(telseq_length, mean)
means <- data.frame(
sample = samples,
tecat = tecat_means,
telogator = telogator_means,
telseq = telseq_means
)
diff <- data.frame(
sample = samples,
tecat = means$tecat - means$telseq * 1000,
telogator = means$telogator - means$telseq * 1000
)
ggplot(melt(diff), aes(x = sample, y = value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Difference in telomere lengths",
x = "Sample",
y = "Difference in telomere length (bp)") +
theme_minimal()## Using sample as id variables
## tecat telogator
## 1 1105.762 3076.021
## tecat telogator
## 1 7740.333 21532.14
tecat_time <- list()
telogator_time <- list()
for(s in samples) {
tecat_time[[s]] <- as.numeric(strsplit(grep("TECAT time taken", readLines(file.path("../code/logs", paste0(s, ".log"))), value = TRUE), ":")[[1]][2])
telogator_time[[s]] <- as.numeric(strsplit(grep("Telogator time taken", readLines(file.path("../code/logs", paste0(s, ".log"))), value = TRUE), ":")[[1]][2])
}
time_data <- data.frame()
for(s in samples) {
time_data <- rbind(
time_data,
data.frame(
time = c(tecat_time[[s]], telogator_time[[s]]),
tool = c(rep("TECAT", 1), rep("TELOGATOR", 1)),
sample = c(rep(s, 2))
)
)
}
# Summarise data
time_data %>%
group_by(tool) %>%
summarise(mean_time = mean(time), sd_time = sd(time))## # A tibble: 2 × 3
## tool mean_time sd_time
## <chr> <dbl> <dbl>
## 1 TECAT 5189. 2861.
## 2 TELOGATOR 9926. 6808.
# Convert time from seconds to minutes and label minutes
time_data$time <- time_data$time / 60
# Summarise data again
time_data %>%
group_by(tool) %>%
summarise(mean_time = mean(time), sd_time = sd(time))## # A tibble: 2 × 3
## tool mean_time sd_time
## <chr> <dbl> <dbl>
## 1 TECAT 86.5 47.7
## 2 TELOGATOR 165. 113.
ggplot(time_data, aes(x = tool, y = time, fill = tool)) +
geom_violin() +
geom_boxplot(width = 0.1) +
labs(title = "Time taken to run each tool with 18 threads",
x = "Tool",
y = "Time taken (minutes)") +
theme_minimal()# Turn warnings off
options(warn=-1)
for (s in samples) {
mapped_output <- tecat[[s]]
left <- mapped_output[["results"]][mapped_output[["results"]]$chromEnd == "5'", ]
ord <- order(as.numeric(stringr::str_extract(left$ref_name, "\\d+")))
left <- left[ord, ]
left$ref_name <- factor(left$ref_name, levels = unique(left$ref_name))
right <- mapped_output[["results"]][mapped_output[["results"]]$chromEnd == "3'", ]
ord <- order(as.numeric(stringr::str_extract(right$ref_name, "\\d+")))
right <- right[ord, ]
right$ref_name <- factor(right$ref_name, levels = unique(right$ref_name))
top <- ggplot(left, aes(x = ref_name, y = telomere_length / 1000, fill = chromEnd)) +
geom_violin(drop = FALSE) +
geom_boxplot(width = 0.1) +
scale_fill_manual(values = c("#0091ff")) +
labs(x = "Chromosome",
y = "Telomere Length (kb)",
title = paste("TECAT Lengths by Chromosome End 5'", s)) +
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1)) +
theme(text = element_text(size = 18, family = "Arial"))
bottom <- ggplot(right, aes(x = ref_name, y = telomere_length / 1000, fill = chromEnd)) +
geom_violin(drop = FALSE) +
geom_boxplot(width = 0.1) +
scale_fill_manual(values = c("#ff00b7")) +
labs(x = "Chromosome",
y = "Telomere Length (kb)",
title = paste("TECAT Lengths by Chromosome End 3'", s)) +
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1)) +
theme(text = element_text(size = 18, family = "Arial"))
grid <- plot_grid(top, bottom, ncol = 1)
print(grid)
ggsave(file = file.path("./figures", paste0(s, "_tecat_violin.png")), plot = grid)
# Do the same for Telogator
telogator_output <- telogator[[s]]
# Split to p and q arms by rows in the data frame
# Turn telogator output into a long form data frame
# Isolate just the variables we need
telogator_output <- telogator_output[, c("X.chr", "read_TLs")]
telogator_output <- melt(telogator_output, id.vars = "X.chr", value.name = "read_TLs")
gator_output <- data.frame()
for (r in 1:nrow(telogator_output)) {
key <- telogator_output$X.chr[r]
values <- unlist(strsplit(telogator_output$read_TLs[r], ","))
for (v in values) {
gator_output <- rbind(gator_output, data.frame(X.chr = key, read_TLs = as.numeric(v))
)
}
}
p <- gator_output[grepl("p", gator_output$X.chr), ]
p[["end"]] <- "p"
q <- gator_output[grepl("q", gator_output$X.chr), ]
q[["end"]] <- "q"
top <- ggplot(p, aes(x = X.chr, y = read_TLs / 1000, fill = end)) +
geom_violin(drop = FALSE) +
geom_boxplot(width = 0.1) +
scale_fill_manual(values = c("#0091ff")) +
labs(x = "Chromosome",
y = "Telomere Length (kb)",
title = paste("Telogator Lengths by Chromosome End p Arm", s)) +
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1)) +
theme(text = element_text(size = 18, family = "Arial"))
bottom <- ggplot(q, aes(x = X.chr, y = read_TLs / 1000, fill = end)) +
geom_violin(drop = FALSE) +
geom_boxplot(width = 0.1) +
scale_fill_manual(values = c("#ff00b7")) +
labs(x = "Chromosome",
y = "Telomere Length (kb)",
title = paste("Telogator Lengths by Chromosome q Arm", s)) +
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1)) +
theme(text = element_text(size = 18, family = "Arial"))
grid <- plot_grid(top, bottom, ncol = 1)
print(grid)
ggsave(file = file.path("./figures", paste0(s, "_telogator_violin.png")), plot = grid)
}## Saving 16 x 10 in image
## Saving 16 x 10 in image
## Saving 16 x 10 in image
## Saving 16 x 10 in image
## Saving 16 x 10 in image
## Saving 16 x 10 in image
## Saving 16 x 10 in image
## Saving 16 x 10 in image
## Saving 16 x 10 in image
## Saving 16 x 10 in image
## Saving 16 x 10 in image
## Saving 16 x 10 in image
## Saving 16 x 10 in image
## Saving 16 x 10 in image
The TECAT and Telogator tools produced similar results, with the TECAT tool producing slightly more accurate results based on the results of the Telseq tool and literature data. The TECAT tool also took less time to run than the Telogator tool.